The dataset that we are working with has the information of transaction from the period of 1st Jan 2019 - 31st Dec 2020, some of which are legitimate and fraud. The columns in the dataset are:
In today's time, people who are victims of frauds or scams have been increasing with the improvement of technology. With the help of some previous data, where we can learn different patterns of transactions and unusual activity, we can help avoid people as well as organizations to prevent themselves from being a victim of the scam.
From the given dataset, we have a column "is_fraud", which we will be training our models to learn from the train set, and then try to calculate the effectiveness of the models created by testing it on our testing dataset, to check if the model correctly distinguishes if the transaction is fraud or not.
This dataset has already been separated into train and test set.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
import warnings
warnings.filterwarnings(action='ignore')
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn import neighbors, tree, naive_bayes, ensemble
from sklearn.svm import SVC
from math import sqrt
import plotly.graph_objects as go
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import plot_roc_curve
train = pd.read_csv('/Users/nawaazsharif/Desktop/DePaul/Q5/DSC 540- Advanced Machine Learning/Project/archive-2/fraudTrain.csv' , index_col=False)
train
| Unnamed: 0 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | ... | lat | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2019-01-01 00:00:18 | 2703186189652095 | fraud_Rippin, Kub and Mann | misc_net | 4.97 | Jennifer | Banks | F | 561 Perry Cove | ... | 36.0788 | -81.1781 | 3495 | Psychologist, counselling | 1988-03-09 | 0b242abb623afc578575680df30655b9 | 1325376018 | 36.011293 | -82.048315 | 0 |
| 1 | 1 | 2019-01-01 00:00:44 | 630423337322 | fraud_Heller, Gutmann and Zieme | grocery_pos | 107.23 | Stephanie | Gill | F | 43039 Riley Greens Suite 393 | ... | 48.8878 | -118.2105 | 149 | Special educational needs teacher | 1978-06-21 | 1f76529f8574734946361c461b024d99 | 1325376044 | 49.159047 | -118.186462 | 0 |
| 2 | 2 | 2019-01-01 00:00:51 | 38859492057661 | fraud_Lind-Buckridge | entertainment | 220.11 | Edward | Sanchez | M | 594 White Dale Suite 530 | ... | 42.1808 | -112.2620 | 4154 | Nature conservation officer | 1962-01-19 | a1a22d70485983eac12b5b88dad1cf95 | 1325376051 | 43.150704 | -112.154481 | 0 |
| 3 | 3 | 2019-01-01 00:01:16 | 3534093764340240 | fraud_Kutch, Hermiston and Farrell | gas_transport | 45.00 | Jeremy | White | M | 9443 Cynthia Court Apt. 038 | ... | 46.2306 | -112.1138 | 1939 | Patent attorney | 1967-01-12 | 6b849c168bdad6f867558c3793159a81 | 1325376076 | 47.034331 | -112.561071 | 0 |
| 4 | 4 | 2019-01-01 00:03:06 | 375534208663984 | fraud_Keeling-Crist | misc_pos | 41.96 | Tyler | Garcia | M | 408 Bradley Rest | ... | 38.4207 | -79.4629 | 99 | Dance movement psychotherapist | 1986-03-28 | a41d7549acf90789359a9aa5346dcb46 | 1325376186 | 38.674999 | -78.632459 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1296670 | 1296670 | 2020-06-21 12:12:08 | 30263540414123 | fraud_Reichel Inc | entertainment | 15.56 | Erik | Patterson | M | 162 Jessica Row Apt. 072 | ... | 37.7175 | -112.4777 | 258 | Geoscientist | 1961-11-24 | 440b587732da4dc1a6395aba5fb41669 | 1371816728 | 36.841266 | -111.690765 | 0 |
| 1296671 | 1296671 | 2020-06-21 12:12:19 | 6011149206456997 | fraud_Abernathy and Sons | food_dining | 51.70 | Jeffrey | White | M | 8617 Holmes Terrace Suite 651 | ... | 39.2667 | -77.5101 | 100 | Production assistant, television | 1979-12-11 | 278000d2e0d2277d1de2f890067dcc0a | 1371816739 | 38.906881 | -78.246528 | 0 |
| 1296672 | 1296672 | 2020-06-21 12:12:32 | 3514865930894695 | fraud_Stiedemann Ltd | food_dining | 105.93 | Christopher | Castaneda | M | 1632 Cohen Drive Suite 639 | ... | 32.9396 | -105.8189 | 899 | Naval architect | 1967-08-30 | 483f52fe67fabef353d552c1e662974c | 1371816752 | 33.619513 | -105.130529 | 0 |
| 1296673 | 1296673 | 2020-06-21 12:13:36 | 2720012583106919 | fraud_Reinger, Weissnat and Strosin | food_dining | 74.90 | Joseph | Murray | M | 42933 Ryan Underpass | ... | 43.3526 | -102.5411 | 1126 | Volunteer coordinator | 1980-08-18 | d667cdcbadaaed3da3f4020e83591c83 | 1371816816 | 42.788940 | -103.241160 | 0 |
| 1296674 | 1296674 | 2020-06-21 12:13:37 | 4292902571056973207 | fraud_Langosh, Wintheiser and Hyatt | food_dining | 4.30 | Jeffrey | Smith | M | 135 Joseph Mountains | ... | 45.8433 | -113.8748 | 218 | Therapist, horticultural | 1995-08-16 | 8f7c8e4ab7f25875d753b422917c98c9 | 1371816817 | 46.565983 | -114.186110 | 0 |
1296675 rows × 23 columns
Selecting less number of rows for the 'is_fraud' == 0, and keeping all the rows for the 'is_fraud' == 1, and then merging them together to perfrom the machine learning algorithms.
is_fraud = train[train['is_fraud'] == 1]
not_fraud = train[train['is_fraud'] == 0].iloc[:32236]
train = pd.concat([is_fraud, not_fraud], axis=0)
train=train.reset_index(drop = True)
train
| Unnamed: 0 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | ... | lat | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2449 | 2019-01-02 01:06:37 | 4613314721966 | fraud_Rutherford-Mertz | grocery_pos | 281.06 | Jason | Murphy | M | 542 Steve Curve Suite 011 | ... | 35.9946 | -81.7266 | 885 | Soil scientist | 1988-09-15 | e8a81877ae9a0a7f883e15cb39dc4022 | 1325466397 | 36.430124 | -81.179483 | 1 |
| 1 | 2472 | 2019-01-02 01:47:29 | 340187018810220 | fraud_Jenkins, Hauck and Friesen | gas_transport | 11.52 | Misty | Hart | F | 27954 Hall Mill Suite 575 | ... | 29.4400 | -98.4590 | 1595797 | Horticultural consultant | 1960-10-28 | bc7d41c41103877b03232f03f1f8d3f5 | 1325468849 | 29.819364 | -99.142791 | 1 |
| 2 | 2523 | 2019-01-02 03:05:23 | 340187018810220 | fraud_Goodwin-Nitzsche | grocery_pos | 276.31 | Misty | Hart | F | 27954 Hall Mill Suite 575 | ... | 29.4400 | -98.4590 | 1595797 | Horticultural consultant | 1960-10-28 | b98f12f4168391b2203238813df5aa8c | 1325473523 | 29.273085 | -98.836360 | 1 |
| 3 | 2546 | 2019-01-02 03:38:03 | 4613314721966 | fraud_Erdman-Kertzmann | gas_transport | 7.03 | Jason | Murphy | M | 542 Steve Curve Suite 011 | ... | 35.9946 | -81.7266 | 885 | Soil scientist | 1988-09-15 | 397894a5c4c02e3c61c784001f0f14e4 | 1325475483 | 35.909292 | -82.091010 | 1 |
| 4 | 2553 | 2019-01-02 03:55:47 | 340187018810220 | fraud_Koepp-Parker | grocery_pos | 275.73 | Misty | Hart | F | 27954 Hall Mill Suite 575 | ... | 29.4400 | -98.4590 | 1595797 | Horticultural consultant | 1960-10-28 | 7863235a750d73a244c07f1fb7f0185a | 1325476547 | 29.786426 | -98.683410 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 39737 | 32564 | 2019-01-20 13:03:44 | 2274911989136158 | fraud_Champlin and Sons | home | 68.63 | Robert | Velazquez | M | 3136 Silva Stream | ... | 39.5102 | -104.7216 | 84861 | Materials engineer | 1932-03-10 | 24a5a2661499c79ebdacbb86d541280f | 1327064624 | 40.451743 | -104.709372 | 0 |
| 39738 | 32565 | 2019-01-20 13:04:16 | 3531129874770000 | fraud_Treutel-King | travel | 2.57 | Shelby | Mitchell | F | 974 Cindy Stream | ... | 43.8065 | -73.0882 | 5895 | Scientist, marine | 1975-07-13 | ca5902bca7555acc7fdab9514af9ad86 | 1327064656 | 44.066198 | -73.593901 | 0 |
| 39739 | 32566 | 2019-01-20 13:05:56 | 3576021480694169 | fraud_Runolfsdottir, Mueller and Hand | entertainment | 30.71 | Dawn | Gray | F | 9486 Joel Common Suite 554 | ... | 39.1329 | -95.7023 | 163415 | Secondary school teacher | 2004-12-30 | b31782a82be80d076b33ba7788b5cab3 | 1327064756 | 38.540769 | -95.955565 | 0 |
| 39740 | 32567 | 2019-01-20 13:06:00 | 4247921790666 | fraud_Turner LLC | travel | 3.79 | Judith | Moss | F | 46297 Benjamin Plains Suite 703 | ... | 39.5370 | -83.4550 | 22305 | Television floor manager | 1939-03-09 | 88c65b4e1585934d578511e627fe3589 | 1327064760 | 39.156673 | -82.930503 | 0 |
| 39741 | 32568 | 2019-01-20 13:09:16 | 4839043708100390 | fraud_Lesch, D'Amore and Brown | food_dining | 3.14 | Meredith | Campbell | F | 043 Hanson Turnpike | ... | 41.1826 | -92.3097 | 1583 | Geochemist | 1999-06-28 | 4c4ca360699fddc6e332fefe3ff5c75e | 1327064956 | 41.728122 | -92.651439 | 0 |
39742 rows × 23 columns
train['is_fraud'].value_counts()
0 32236 1 7506 Name: is_fraud, dtype: int64
In the training set of the data, the number of rows with 'is_fraud' == 0 are 32236, and the number of rows with 'is_fraud' == 1 are 7506.
pd.set_option('display.max_columns', 100)
train.head()
| Unnamed: 0 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | city | state | zip | lat | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2449 | 2019-01-02 01:06:37 | 4613314721966 | fraud_Rutherford-Mertz | grocery_pos | 281.06 | Jason | Murphy | M | 542 Steve Curve Suite 011 | Collettsville | NC | 28611 | 35.9946 | -81.7266 | 885 | Soil scientist | 1988-09-15 | e8a81877ae9a0a7f883e15cb39dc4022 | 1325466397 | 36.430124 | -81.179483 | 1 |
| 1 | 2472 | 2019-01-02 01:47:29 | 340187018810220 | fraud_Jenkins, Hauck and Friesen | gas_transport | 11.52 | Misty | Hart | F | 27954 Hall Mill Suite 575 | San Antonio | TX | 78208 | 29.4400 | -98.4590 | 1595797 | Horticultural consultant | 1960-10-28 | bc7d41c41103877b03232f03f1f8d3f5 | 1325468849 | 29.819364 | -99.142791 | 1 |
| 2 | 2523 | 2019-01-02 03:05:23 | 340187018810220 | fraud_Goodwin-Nitzsche | grocery_pos | 276.31 | Misty | Hart | F | 27954 Hall Mill Suite 575 | San Antonio | TX | 78208 | 29.4400 | -98.4590 | 1595797 | Horticultural consultant | 1960-10-28 | b98f12f4168391b2203238813df5aa8c | 1325473523 | 29.273085 | -98.836360 | 1 |
| 3 | 2546 | 2019-01-02 03:38:03 | 4613314721966 | fraud_Erdman-Kertzmann | gas_transport | 7.03 | Jason | Murphy | M | 542 Steve Curve Suite 011 | Collettsville | NC | 28611 | 35.9946 | -81.7266 | 885 | Soil scientist | 1988-09-15 | 397894a5c4c02e3c61c784001f0f14e4 | 1325475483 | 35.909292 | -82.091010 | 1 |
| 4 | 2553 | 2019-01-02 03:55:47 | 340187018810220 | fraud_Koepp-Parker | grocery_pos | 275.73 | Misty | Hart | F | 27954 Hall Mill Suite 575 | San Antonio | TX | 78208 | 29.4400 | -98.4590 | 1595797 | Horticultural consultant | 1960-10-28 | 7863235a750d73a244c07f1fb7f0185a | 1325476547 | 29.786426 | -98.683410 | 1 |
# Checking the shape of the training dataset.
train.shape
(39742, 23)
# Checking the datatypes of the training dataset.
train.dtypes
Unnamed: 0 int64 trans_date_trans_time object cc_num int64 merchant object category object amt float64 first object last object gender object street object city object state object zip int64 lat float64 long float64 city_pop int64 job object dob object trans_num object unix_time int64 merch_lat float64 merch_long float64 is_fraud int64 dtype: object
From the above dtypes, we can see that most of the variables are object variables.
# Checking the concise summary of the dataframe.
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 39742 entries, 0 to 39741 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 39742 non-null int64 1 trans_date_trans_time 39742 non-null object 2 cc_num 39742 non-null int64 3 merchant 39742 non-null object 4 category 39742 non-null object 5 amt 39742 non-null float64 6 first 39742 non-null object 7 last 39742 non-null object 8 gender 39742 non-null object 9 street 39742 non-null object 10 city 39742 non-null object 11 state 39742 non-null object 12 zip 39742 non-null int64 13 lat 39742 non-null float64 14 long 39742 non-null float64 15 city_pop 39742 non-null int64 16 job 39742 non-null object 17 dob 39742 non-null object 18 trans_num 39742 non-null object 19 unix_time 39742 non-null int64 20 merch_lat 39742 non-null float64 21 merch_long 39742 non-null float64 22 is_fraud 39742 non-null int64 dtypes: float64(5), int64(6), object(12) memory usage: 7.0+ MB
# Check the missing values in the dataset.
train.isnull().sum()
Unnamed: 0 0 trans_date_trans_time 0 cc_num 0 merchant 0 category 0 amt 0 first 0 last 0 gender 0 street 0 city 0 state 0 zip 0 lat 0 long 0 city_pop 0 job 0 dob 0 trans_num 0 unix_time 0 merch_lat 0 merch_long 0 is_fraud 0 dtype: int64
There are no null values in our dataset.
# Describing all variables/columns of the dataset.
train.describe(include='all')
| Unnamed: 0 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | city | state | zip | lat | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3.974200e+04 | 39742 | 3.974200e+04 | 39742 | 39742 | 39742.000000 | 39742 | 39742 | 39742 | 39742 | 39742 | 39742 | 39742.000000 | 39742.000000 | 39742.000000 | 3.974200e+04 | 39742 | 39742 | 39742 | 3.974200e+04 | 39742.000000 | 39742.000000 | 39742.000000 |
| unique | NaN | 39360 | NaN | 693 | 14 | NaN | 352 | 481 | 2 | 983 | 894 | 51 | NaN | NaN | NaN | NaN | 494 | 968 | 39742 | NaN | NaN | NaN | NaN |
| top | NaN | 2019-01-14 16:40:31 | NaN | fraud_Cormier LLC | grocery_pos | NaN | Christopher | Smith | F | 7618 Gonzales Mission | Warren | TX | NaN | NaN | NaN | NaN | Exhibition designer | 1977-03-23 | e8a81877ae9a0a7f883e15cb39dc4022 | NaN | NaN | NaN | NaN |
| freq | NaN | 3 | NaN | 150 | 4808 | NaN | 886 | 825 | 21276 | 103 | 161 | 2796 | NaN | NaN | NaN | NaN | 293 | 144 | 1 | NaN | NaN | NaN | NaN |
| mean | 1.312085e+05 | NaN | 4.158598e+17 | NaN | NaN | 155.169277 | NaN | NaN | NaN | NaN | NaN | NaN | 48640.918248 | 38.549945 | -90.181419 | 9.058398e+04 | NaN | NaN | NaN | 1.330398e+09 | 38.548398 | -90.182947 | 0.188868 |
| std | 2.953163e+05 | NaN | 1.306173e+18 | NaN | NaN | 276.879146 | NaN | NaN | NaN | NaN | NaN | NaN | 27007.726580 | 5.089440 | 13.914702 | 3.035233e+05 | NaN | NaN | NaN | 1.056781e+07 | 5.122541 | 13.933079 | 0.391409 |
| min | 0.000000e+00 | NaN | 6.041621e+10 | NaN | NaN | 1.000000 | NaN | NaN | NaN | NaN | NaN | NaN | 1257.000000 | 20.027100 | -165.672300 | 2.300000e+01 | NaN | NaN | NaN | 1.325376e+09 | 19.040141 | -166.629875 | 0.000000 |
| 25% | 9.935250e+03 | NaN | 1.800429e+14 | NaN | NaN | 13.672500 | NaN | NaN | NaN | NaN | NaN | NaN | 25526.000000 | 34.703200 | -96.790900 | 7.430000e+02 | NaN | NaN | NaN | 1.325905e+09 | 34.825385 | -96.877697 | 0.000000 |
| 50% | 1.987050e+04 | NaN | 3.521261e+15 | NaN | NaN | 56.500000 | NaN | NaN | NaN | NaN | NaN | NaN | 48043.000000 | 39.354300 | -87.410100 | 2.501000e+03 | NaN | NaN | NaN | 1.326421e+09 | 39.353395 | -87.340745 | 0.000000 |
| 75% | 2.980575e+04 | NaN | 4.642255e+15 | NaN | NaN | 118.457500 | NaN | NaN | NaN | NaN | NaN | NaN | 72011.000000 | 41.846700 | -80.128400 | 2.112500e+04 | NaN | NaN | NaN | 1.326905e+09 | 41.922669 | -80.155785 | 0.000000 |
| max | 1.295733e+06 | NaN | 4.992346e+18 | NaN | NaN | 11872.210000 | NaN | NaN | NaN | NaN | NaN | NaN | 99783.000000 | 66.693300 | -67.950300 | 2.906700e+06 | NaN | NaN | NaN | 1.371787e+09 | 67.510267 | -66.967742 | 1.000000 |
train_new = train[['category', 'amt', 'gender', 'is_fraud']]
train_new.head()
| category | amt | gender | is_fraud | |
|---|---|---|---|---|
| 0 | grocery_pos | 281.06 | M | 1 |
| 1 | gas_transport | 11.52 | F | 1 |
| 2 | grocery_pos | 276.31 | F | 1 |
| 3 | gas_transport | 7.03 | M | 1 |
| 4 | grocery_pos | 275.73 | F | 1 |
fig, ax = plt.subplots(figsize=(35, 38))
for i , val in enumerate(train_new):
ax=fig.add_subplot(2,2,i+1)
plt.hist(train_new[val], color='#5F093D' , bins=45)
plt.xticks(rotation=90)
plt.title(val.upper())
From the above histograms, seeing the subplot of 'category', we can see that more of the transactions that we have are from the category of grocery_pos, shopping_net, gas_transport. Also, seeing the subplot of 'gender' we can see that more transactions are made by 'females'.
# Top 10 values in continous variables
for i in train_new:
print(i.upper())
print(train_new[i].value_counts()[:10])
CATEGORY
grocery_pos 4808
shopping_net 4107
gas_transport 3923
shopping_pos 3755
home 3297
kids_pets 2949
misc_net 2571
entertainment 2570
personal_care 2477
food_dining 2464
Name: category, dtype: int64
AMT
1.14 24
2.31 21
2.24 19
9.37 18
1.90 18
..
7.69 8
8.98 8
4.16 8
17.09 8
10.00 8
Name: amt, Length: 682, dtype: int64
GENDER
F 21276
M 18466
Name: gender, dtype: int64
IS_FRAUD
0 32236
1 7506
Name: is_fraud, dtype: int64
category_values = train_new['category'].value_counts()
category_values = pd.DataFrame({'category_idx':category_values.index, 'category_count':category_values.values})
category_values = category_values.head(15)
plt.figure(figsize = (10,10))
sns.set_theme(style='whitegrid', palette ='Accent')
sns.barplot(data = category_values,x = 'category_idx', y = 'category_count' )
plt.ylabel('Count of Categories')
plt.xlabel('Different Categories')
plt.xticks(rotation = 45)
plt.title('Top Categories', fontsize = 16)
plt.show()
**The top categories of places where transactions took place are:
1. grocery_pos
2. shopping_net
3. gas_transport
4. shopping_pos
5. home
6. kids_pets
7. misc_net
8. entertainment
9. personal_care
10. food_dining
11. health_fitness
12. misc_pos
13. grocery_net
14. travel**
cat1 = train_new['category'].unique()
cat2 = train_new['category'].value_counts()
palette = ["magma"]
fig = go.Figure(data=[go.Pie(labels = cat1,
values = cat2,
hole=.3,
title = 'Categories',
legendgroup = True,
pull = [0.01, 0.01, 0.01, 0.01]
)
]
)
fig.update_traces(marker = dict(colors = palette))
fig.show()
From the above pie chart, we can see the percentage of each category that a category takes place.
train.corr()
| Unnamed: 0 | cc_num | amt | zip | lat | long | city_pop | unix_time | merch_lat | merch_long | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Unnamed: 0 | 1.000000 | 0.000571 | 0.537204 | -0.005588 | -0.000721 | 0.006307 | 0.008891 | 0.999270 | -0.001538 | 0.006302 | 0.806773 |
| cc_num | 0.000571 | 1.000000 | -0.000096 | 0.045618 | -0.057426 | -0.051944 | -0.007855 | 0.000535 | -0.057102 | -0.052190 | -0.005727 |
| amt | 0.537204 | -0.000096 | 1.000000 | -0.014927 | 0.012270 | 0.012950 | 0.017255 | 0.546401 | 0.011597 | 0.013178 | 0.655558 |
| zip | -0.005588 | 0.045618 | -0.014927 | 1.000000 | -0.103868 | -0.908855 | 0.088210 | -0.006312 | -0.102997 | -0.908096 | -0.010760 |
| lat | -0.000721 | -0.057426 | 0.012270 | -0.103868 | 1.000000 | -0.032817 | -0.162140 | 0.000080 | 0.993619 | -0.032813 | 0.010777 |
| long | 0.006307 | -0.051944 | 0.012950 | -0.908855 | -0.032817 | 1.000000 | -0.059813 | 0.006889 | -0.032815 | 0.999135 | 0.009203 |
| city_pop | 0.008891 | -0.007855 | 0.017255 | 0.088210 | -0.162140 | -0.059813 | 1.000000 | 0.008859 | -0.161624 | -0.059957 | 0.010640 |
| unix_time | 0.999270 | 0.000535 | 0.546401 | -0.006312 | 0.000080 | 0.006889 | 0.008859 | 1.000000 | -0.000746 | 0.006874 | 0.821524 |
| merch_lat | -0.001538 | -0.057102 | 0.011597 | -0.102997 | 0.993619 | -0.032815 | -0.161624 | -0.000746 | 1.000000 | -0.032797 | 0.009938 |
| merch_long | 0.006302 | -0.052190 | 0.013178 | -0.908096 | -0.032813 | 0.999135 | -0.059957 | 0.006874 | -0.032797 | 1.000000 | 0.009252 |
| is_fraud | 0.806773 | -0.005727 | 0.655558 | -0.010760 | 0.010777 | 0.009203 | 0.010640 | 0.821524 | 0.009938 | 0.009252 | 1.000000 |
# code
plt.figure(figsize=(25,25)) #configuring fig size
sns.heatmap(train.corr(), annot = True, cmap = "Greens")
plt.show()
From the above correlation map, we can infer that:
X_train = train_new[['category', 'amt', 'gender']]
# X_train
cat_feats = ['category', 'gender']
X_train = pd.get_dummies(X_train, columns=cat_feats, drop_first=False)
X_train
| amt | category_entertainment | category_food_dining | category_gas_transport | category_grocery_net | category_grocery_pos | category_health_fitness | category_home | category_kids_pets | category_misc_net | category_misc_pos | category_personal_care | category_shopping_net | category_shopping_pos | category_travel | gender_F | gender_M | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 281.06 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 11.52 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 276.31 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 7.03 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 275.73 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 39737 | 68.63 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 39738 | 2.57 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 39739 | 30.71 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 39740 | 3.79 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 39741 | 3.14 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
39742 rows × 17 columns
Getting the dummy variables for the columns 'category' and 'gender'.
y_train = train_new.is_fraud
# Y_train
Importing the test set
test = pd.read_csv('/Users/nawaazsharif/Desktop/DePaul/Q5/DSC 540- Advanced Machine Learning/Project/archive-2/fraudTest.csv')
Selecting less number of rows for the 'is_fraud' == 0, and keeping all the rows for the 'is_fraud' == 1, and then merging them together to perfrom the machine learning algorithms.
is_fraud = test[test['is_fraud'] == 1]
not_fraud = test[test['is_fraud'] == 0].iloc[:21751]
test = pd.concat([is_fraud, not_fraud], axis=0)
test=test.reset_index(drop = True)
test
| Unnamed: 0 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | city | state | zip | lat | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1685 | 2020-06-21 22:06:39 | 3560725013359375 | fraud_Hamill-D'Amore | health_fitness | 24.84 | Brooke | Smith | F | 63542 Luna Brook Apt. 012 | Notrees | TX | 79759 | 31.8599 | -102.7413 | 23 | Cytogeneticist | 1969-09-15 | 16bf2e46c54369a8eab2214649506425 | 1371852399 | 32.575873 | -102.604290 | 1 |
| 1 | 1767 | 2020-06-21 22:32:22 | 6564459919350820 | fraud_Rodriguez, Yost and Jenkins | misc_net | 780.52 | Douglas | Willis | M | 619 Jeremy Garden Apt. 681 | Benton | WI | 53803 | 42.5545 | -90.3508 | 1306 | Public relations officer | 1958-09-10 | ab4b379d2c0c9c667d46508d4e126d72 | 1371853942 | 42.461127 | -91.147148 | 1 |
| 2 | 1781 | 2020-06-21 22:37:27 | 6564459919350820 | fraud_Nienow PLC | entertainment | 620.33 | Douglas | Willis | M | 619 Jeremy Garden Apt. 681 | Benton | WI | 53803 | 42.5545 | -90.3508 | 1306 | Public relations officer | 1958-09-10 | 47a9987ae81d99f7832a54b29a77bf4b | 1371854247 | 42.771834 | -90.158365 | 1 |
| 3 | 1784 | 2020-06-21 22:38:55 | 4005676619255478 | fraud_Heathcote, Yost and Kertzmann | shopping_net | 1077.69 | William | Perry | M | 458 Phillips Island Apt. 768 | Denham Springs | LA | 70726 | 30.4590 | -90.9027 | 71335 | Herbalist | 1994-05-31 | fe956c7e4a253c437c18918bf96f7b62 | 1371854335 | 31.204974 | -90.261595 | 1 |
| 4 | 1857 | 2020-06-21 23:02:16 | 3560725013359375 | fraud_Hermann and Sons | shopping_pos | 842.65 | Brooke | Smith | F | 63542 Luna Brook Apt. 012 | Notrees | TX | 79759 | 31.8599 | -102.7413 | 23 | Cytogeneticist | 1969-09-15 | f6838c01f5d2262006e6b71d33ba7c6d | 1371855736 | 31.315782 | -102.736390 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23891 | 21816 | 2020-06-28 22:42:01 | 3560797065840735 | fraud_Heller PLC | health_fitness | 10.24 | Janet | Turner | F | 0925 Lang Extensions | Shields | ND | 58569 | 46.1838 | -101.2589 | 77 | Film/video editor | 1989-12-17 | 15ab616c05fe891f706f5dff75e850a0 | 1372459321 | 46.735496 | -101.584722 | 0 |
| 23892 | 21817 | 2020-06-28 22:42:14 | 4904681492230012 | fraud_Wuckert, Wintheiser and Friesen | home | 37.08 | Lisa | Lowe | F | 574 David Locks Suite 207 | Cottekill | NY | 12419 | 41.8467 | -74.1038 | 722 | Comptroller | 1990-10-19 | 85c4c226db6d9cdd593e0d23a881f580 | 1372459334 | 42.209306 | -73.539428 | 0 |
| 23893 | 21818 | 2020-06-28 22:42:45 | 6528911529051375 | fraud_Jast and Sons | food_dining | 43.19 | Diane | Smith | F | 195 Murray Overpass Apt. 384 | Winter | WI | 54896 | 45.8327 | -91.0144 | 1478 | Neurosurgeon | 1965-04-27 | 27e47e580f7e3a593cd30dfb4d6783e1 | 1372459365 | 46.076510 | -90.953992 | 0 |
| 23894 | 21819 | 2020-06-28 22:42:45 | 375767678113375 | fraud_Koss, Hansen and Lueilwitz | home | 72.35 | Christopher | Patterson | M | 16744 Campbell Wall Apt. 372 | Timberville | VA | 22853 | 38.6476 | -78.7717 | 4367 | Waste management officer | 1962-04-12 | f136a0e7e1f3812ca2907c9a9cea0755 | 1372459365 | 38.667827 | -79.396782 | 0 |
| 23895 | 21820 | 2020-06-28 22:42:52 | 571465035400 | fraud_Friesen Inc | shopping_pos | 128.81 | Louis | Fisher | M | 45654 Hess Rest | Fort Washakie | WY | 82514 | 43.0048 | -108.8964 | 1645 | Freight forwarder | 1976-02-26 | 805237a77d385c189205efaeab456105 | 1372459372 | 43.162511 | -109.462904 | 0 |
23896 rows × 23 columns
test.shape
(23896, 23)
cat1 = is_fraud['category'].unique()
cat2 = is_fraud['category'].value_counts()
palette = ["magma"]
fig = go.Figure(data=[go.Pie(labels = cat1,
values = cat2,
hole=.3,
title = 'Categories',
legendgroup = True,
pull = [0.01, 0.01, 0.01, 0.01]
)
]
)
fig.update_traces(marker = dict(colors = palette))
fig.show()
From the above pie chart, we can see that the highest number of transactions where is_fraud = 1, is for health_fitness, misc_net and entertainment.
test_new = test[['category', 'amt', 'gender', 'is_fraud']]
test_new
| category | amt | gender | is_fraud | |
|---|---|---|---|---|
| 0 | health_fitness | 24.84 | F | 1 |
| 1 | misc_net | 780.52 | M | 1 |
| 2 | entertainment | 620.33 | M | 1 |
| 3 | shopping_net | 1077.69 | M | 1 |
| 4 | shopping_pos | 842.65 | F | 1 |
| ... | ... | ... | ... | ... |
| 23891 | health_fitness | 10.24 | F | 0 |
| 23892 | home | 37.08 | F | 0 |
| 23893 | food_dining | 43.19 | F | 0 |
| 23894 | home | 72.35 | M | 0 |
| 23895 | shopping_pos | 128.81 | M | 0 |
23896 rows × 4 columns
X_test = test_new[['category', 'amt', 'gender']]
# X_test
X_test = pd.get_dummies(X_test, columns=cat_feats, drop_first=False)
X_test
| amt | category_entertainment | category_food_dining | category_gas_transport | category_grocery_net | category_grocery_pos | category_health_fitness | category_home | category_kids_pets | category_misc_net | category_misc_pos | category_personal_care | category_shopping_net | category_shopping_pos | category_travel | gender_F | gender_M | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24.84 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 780.52 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 620.33 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 1077.69 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 4 | 842.65 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 23891 | 10.24 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 23892 | 37.08 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 23893 | 43.19 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 23894 | 72.35 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 23895 | 128.81 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
23896 rows × 17 columns
y_test = test_new.is_fraud
# y_test
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter = 10000, C=0.1)
# code
#model fitting
clf.fit(X_train, y_train)
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=10000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
#calculating accuracy for training and testing set
pred = clf.predict(X_test)
#train set
train_set_pred = clf.predict(X_train)
acc_train = accuracy_score(y_train, train_set_pred)
print("Accuracy of training set is: ", acc_train)
Accuracy of training set is: 0.9385285088822908
The accuracy of training set by performing logistic regression is 0.938
acc_test = accuracy_score(y_test, pred)
print("Accuracy of testing set is: ", acc_test)
Accuracy of testing set is: 0.9638433210579176
The accuracy of testing set by performing logistic regression is 0.963
#confusion matrix
True_neg, False_pos, False_neg, True_pos = confusion_matrix(y_test, pred).ravel()
#Calculating Recall, Specificity,
#Precision, False Positive Rate
#and F1 Score
# Recall
r = recall_score(y_test, pred, pos_label=0)
print("The Recall is: ", r)
s = True_neg / (True_neg + False_pos)
print("The Specificity is: ", s)
p = precision_score(y_test, pred, pos_label = 0)
print("The Precision is: ", p)
fpr = False_pos / (False_pos + True_neg)
print("The False Positive Rate is: ", fpr)
f1 = f1_score(y_test, pred, pos_label=0)
print("The F1 Score is: ", f1)
The Recall is: 0.9885062755735369 The Specificity is: 0.9885062755735369 The Precision is: 0.9722360388876329 The False Positive Rate is: 0.011493724426463152 The F1 Score is: 0.9803036520311859
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=1)
# code
clf.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
pred = clf.predict(X_test)
#Training set accuracy
train_pred = clf.predict(X_train)
train_acc = accuracy_score(y_train, train_pred)
print("Accuracy of training data using KNN is ", train_acc)
Accuracy of training data using KNN is 0.995118514417996
The accuracy of training set by performing KNeighbors Classifier is 0.995
#Testing set accuracy
test_acc = accuracy_score(y_test, pred)
print("Accuracy of testing data using KNN is ", test_acc)
Accuracy of testing data using KNN is 0.9593237361901573
The accuracy of testing set by performing Kneighbors Classifier is 0.959
#Confusion Matrix
#confusion matrix
True_neg, False_pos, False_neg, True_pos = confusion_matrix(y_test, pred).ravel()
#Calculating Recall, Specificity,
#Precision
#and F1 Score
# Recall
r = recall_score(y_test, pred, pos_label=0)
print("The Recall is: ", r)
s = True_neg / (True_neg + False_pos)
print("The Specificity is: ", s)
p = precision_score(y_test, pred, pos_label = 0)
print("The Precision is: ", p)
f1 = f1_score(y_test, pred, pos_label = 0)
print("The F1 Score is: ", f1)
The Recall is: 0.9673118477311388 The Specificity is: 0.9673118477311388 The Precision is: 0.9877470541289142 The F1 Score is: 0.9774226516770416
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
%%time
dec_tree = DecisionTreeClassifier()
#fitting the model
dec_tree.fit(X_train, y_train)
CPU times: user 57.7 ms, sys: 4.3 ms, total: 62 ms Wall time: 61.6 ms
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
max_depth=None, max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort='deprecated',
random_state=None, splitter='best')
%%time
pred_train = dec_tree.predict(X_train)
pred_test = dec_tree.predict(X_test)
CPU times: user 7.12 ms, sys: 3.46 ms, total: 10.6 ms Wall time: 9.16 ms
#Accuracy of train
train_acc = accuracy_score(y_train, pred_train)
print(f"The training accuracy of decision tree is: {train_acc}")
The training accuracy of decision tree is: 0.9955211111670274
The accuracy of training set by performing decision tree is 0.995
#validation score
val = accuracy_score(y_test, pred_test)
print(f"The testing accuracy of decision tree is: {val}")
The testing accuracy of decision tree is: 0.9688232340140609
The accuracy of testing set by performing decision tree is 0.968
conf_train = confusion_matrix(y_train, pred_train)
print(f"The confusion matrix of decision tree is:\n {conf_train}")
The confusion matrix of decision tree is: [[32217 19] [ 159 7347]]
conf_test = confusion_matrix(y_test, pred_test)
print(f"The confusion matrix of test set of decision tree is:\n {conf_test}")
The confusion matrix of test set of decision tree is: [[21260 491] [ 254 1891]]
#confusion matrix
True_neg, False_pos, False_neg, True_pos = confusion_matrix(y_test, pred_test).ravel()
#Performance metrics for validation set
#Calculating Recall, Specificity,
#Precision, Balanced accuracy,
#and F1 Score
# Recall
r = recall_score(y_test, pred_test, pos_label=0)
print("The Recall of Decision Tree is: ", r)
s = True_neg / (True_neg + False_pos)
print("The Specificity of Decision Tree is: ", s)
p = precision_score(y_test, pred_test, pos_label=0)
print("The Precision of Decision Tree is: ", p)
b = balanced_accuracy_score(y_test, pred_test)
print("The Balanced Accuracy Score of Decision Tree is: ", b)
f1 = f1_score(y_test, pred_test, pos_label=0)
print("The F1 Score of Decision Tree is: ", f1)
The Recall of Decision Tree is: 0.9774263252264264 The Specificity of Decision Tree is: 0.9774263252264264 The Precision of Decision Tree is: 0.9881937343125407 The Balanced Accuracy Score of Decision Tree is: 0.9295057034057539 The F1 Score of Decision Tree is: 0.9827805385415462
#Applying hyper-parameter and cross validation
#using min_samples_split
%%time
parameters = {'min_samples_split':[5, 10, 20, 30, 40, 50]}
dec_tree_min_samples = GridSearchCV(DecisionTreeClassifier(),
param_grid = parameters,
scoring = 'f1',
cv = 2)
CPU times: user 51 µs, sys: 140 µs, total: 191 µs Wall time: 194 µs
%%time
dec_tree_min_samples.fit(X_train, y_train)
pred_dec_tree_min_samples = dec_tree_min_samples.best_estimator_.predict(X_test)
CPU times: user 456 ms, sys: 8.15 ms, total: 464 ms Wall time: 465 ms
#Performance Metrics
recall_min_samples = recall_score(y_test, pred_dec_tree_min_samples, pos_label=0)
print(f"Recall of DT for min_samples_split is: {recall_min_samples}\n")
precision_min_samples = precision_score(y_test, pred_dec_tree_min_samples, pos_label=0)
print(f"Precision of DT for min_samples_split is: {precision_min_samples}\n")
balanced_accuracy_min_samples = balanced_accuracy_score(y_test, pred_dec_tree_min_samples)
print(f"Balanced Accuracy of DT for min_samples_split is: {balanced_accuracy_min_samples}\n")
f1_min_samples = f1_score(y_test, pred_dec_tree_min_samples, pos_label=0)
print(f"F1 score of DT for min_samples_split is: {f1_min_samples}")
Recall of DT for min_samples_split is: 0.9837248862121282 Precision of DT for min_samples_split is: 0.988953595858754 Balanced Accuracy of DT for min_samples_split is: 0.9361514873951085 F1 score of DT for min_samples_split is: 0.9863323115218845
#Applying hyper-parameter and cross validation
#using max_depth
%%time
parameters = {'max_depth':[1,2,3,None]}
dec_tree_max_depth = GridSearchCV(DecisionTreeClassifier(),
param_grid = parameters,
scoring = 'f1',
cv = 2)
CPU times: user 20 µs, sys: 2 µs, total: 22 µs Wall time: 23.6 µs
%%time
dec_tree_max_depth.fit(X_train, y_train)
pred_dec_tree_max_depth = dec_tree_max_depth.best_estimator_.predict(X_test)
CPU times: user 220 ms, sys: 4 ms, total: 224 ms Wall time: 225 ms
#Performance Metrics
recall_max_depth = recall_score(y_test, pred_dec_tree_max_depth, pos_label=0)
print(f"Recall of DT for max_depth is: {recall_max_depth}\n")
precision_max_depth = precision_score(y_test, pred_dec_tree_max_depth, pos_label=0)
print(f"Precision of DT for max_depth is: {precision_max_depth}\n")
balanced_accuracy_max_depth = balanced_accuracy_score(y_test, pred_dec_tree_max_depth)
print(f"Balanced Accuracy of DT for max_depth is: {balanced_accuracy_max_depth}\n")
f1_max_depth = f1_score(y_test, pred_dec_tree_max_depth, pos_label=0)
print(f"F1 score of DT for max_depth is: {f1_max_depth}")
Recall of DT for max_depth is: 0.9774723001241322 Precision of DT for max_depth is: 0.9881024306362411 Balanced Accuracy of DT for max_depth is: 0.9290624903884064 F1 score of DT for max_depth is: 0.9827586206896552
#Applying hyper-parameter and cross validation
#usning min_samples_split and
#using max_depth
%%time
parameters = {'min_samples_split':[5, 10, 20, 30, 40, 50], 'max_depth':[1,2,3,None]}
dec_tree_min_max = GridSearchCV(DecisionTreeClassifier(),
param_grid = parameters,
scoring = 'f1',
cv = 2)
CPU times: user 17 µs, sys: 0 ns, total: 17 µs Wall time: 20.3 µs
%%time
dec_tree_min_max.fit(X_train, y_train)
pred_dec_tree_min_max = dec_tree_min_max.best_estimator_.predict(X_test)
CPU times: user 943 ms, sys: 6.66 ms, total: 950 ms Wall time: 964 ms
#Performance Metrics
recall_min_max = recall_score(y_test, pred_dec_tree_min_max, pos_label=0)
print(f"Recall of DT for min_max is: {recall_min_max}\n")
precision_min_max = precision_score(y_test, pred_dec_tree_min_max, pos_label=0)
print(f"Precision of DT for min_max is: {precision_min_max}\n")
balanced_accuracy_min_max = balanced_accuracy_score(y_test, pred_dec_tree_min_max)
print(f"Balanced Accuracy of DT for min_max is: {balanced_accuracy_min_max}\n")
f1_min_max = f1_score(y_test, pred_dec_tree_min_max, pos_label=0)
print(f"F1 score of DT for min_max is: {f1_min_max}")
Recall of DT for min_max is: 0.9837248862121282 Precision of DT for min_max is: 0.988953595858754 Balanced Accuracy of DT for min_max is: 0.9361514873951085 F1 score of DT for min_max is: 0.9863323115218845
from sklearn.ensemble import RandomForestClassifier
%%time
random_for = RandomForestClassifier()
#fitting the model
random_for.fit(X_train, y_train)
CPU times: user 1.48 s, sys: 10.9 ms, total: 1.49 s Wall time: 1.49 s
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
%%time
pred_train_rf = random_for.predict(X_train)
pred_test_rf = random_for.predict(X_test)
CPU times: user 458 ms, sys: 3.66 ms, total: 461 ms Wall time: 461 ms
#Accuracy of train
train_acc_rf = accuracy_score(y_train, pred_train_rf)
print(f"The training accuracy of Random Forest is: {train_acc_rf}")
The training accuracy of Random Forest is: 0.9955211111670274
The accuracy of training set by performing random forest is 0.995
#validation score
test_rf = accuracy_score(y_test, pred_test_rf)
print(f"The validation accuracy of Random Forest is: {test_rf}")
The validation accuracy of Random Forest is: 0.9678607298292601
The accuracy of testing set by performing random forest is 0.968
conf_train_rf = confusion_matrix(y_train, pred_train_rf)
print(f"The confusion matrix of Random Forest is:\n {conf_train_rf}")
The confusion matrix of Random Forest is: [[32168 68] [ 110 7396]]
conf_test_rf = confusion_matrix(y_test, pred_test_rf)
print(f"The confusion matrix of validation of Random Forest is:\n {conf_test_rf}")
The confusion matrix of validation of Random Forest is: [[21232 519] [ 249 1896]]
#confusion matrix
True_neg_rf, False_pos_rf, False_neg_rf, True_pos_rf = confusion_matrix(y_test, pred_test_rf).ravel()
#Performance metrics for validation set
#Calculating Recall, Specificity,
#Precision, Balanced accuracy,
#and F1 Score
# Recall
r = recall_score(y_test, pred_test_rf, pos_label=0)
print("The Recall of Random Forest is: ", r)
s = True_neg_rf / (True_neg_rf + False_pos_rf)
print("The Specificity of Random Forest is: ", s)
p = precision_score(y_test, pred_test_rf, pos_label=0)
print("The Precision of Random Forest is: ", p)
b = balanced_accuracy_score(y_test, pred_test_rf)
print("The Balanced Accuracy Score of Random Forest is: ", b)
f1 = f1_score(y_test, pred_test_rf, pos_label=0)
print("The F1 Score of Random Forest is: ", f1)
The Recall of Random Forest is: 0.9761390280906626 The Specificity of Random Forest is: 0.9761390280906626 The Precision of Random Forest is: 0.9884083608770542 The Balanced Accuracy Score of Random Forest is: 0.9300275560033733 The F1 Score of Random Forest is: 0.9822353811991118
#Applying hyper-parameter and cross validation
#using min_samples_split and
#using max_depth and
#using n_estimators
%%time
parameters = {'min_samples_split':[1,2,3],
'max_depth':[1,2,3],
'n_estimators':[200,400]}
rf_tuning = GridSearchCV(RandomForestClassifier(),
param_grid = parameters,
scoring = 'f1',
cv = 2)
CPU times: user 35 µs, sys: 12 µs, total: 47 µs Wall time: 50.1 µs
%%time
rf_tuning.fit(X_train, y_train)
rf_mins_maxd_nest = rf_tuning.best_estimator_.predict(X_test)
CPU times: user 29.8 s, sys: 186 ms, total: 30 s Wall time: 30.1 s
#Performance Metrics
recall_rf_tuning = recall_score(y_test, rf_mins_maxd_nest, pos_label=0)
print(f"Recall of RF is: {recall_rf_tuning}\n")
precision_rf_tuning = precision_score(y_test, rf_mins_maxd_nest, pos_label=0)
print(f"Precision of RF is: {precision_rf_tuning}\n")
balanced_rf_tuning = balanced_accuracy_score(y_test, rf_mins_maxd_nest)
print(f"Balanced Accuracy of RF is: {balanced_rf_tuning}\n")
f1_rf_tuning = f1_score(y_test, rf_mins_maxd_nest, pos_label=0)
print(f"F1 score of RF is: {f1_rf_tuning}")
Recall of RF is: 0.9938393637074158 Precision of RF is: 0.9725996580581301 Balanced Accuracy of RF is: 0.8549616398956659 F1 score of RF is: 0.9831048047843192
from sklearn.ensemble import AdaBoostClassifier
%%time
ada_boost = AdaBoostClassifier()
#fitting the model
ada_boost.fit(X_train, y_train)
CPU times: user 596 ms, sys: 8.83 ms, total: 605 ms Wall time: 612 ms
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
n_estimators=50, random_state=None)
%%time
pred_train_ada = ada_boost.predict(X_train)
pred_test_ada = ada_boost.predict(X_test)
CPU times: user 204 ms, sys: 3.28 ms, total: 207 ms Wall time: 207 ms
#Accuracy of train
train_acc_ada = accuracy_score(y_train, pred_train_ada)
print(f"The training accuracy of AdaBoost is: {train_acc_ada}")
The training accuracy of AdaBoost is: 0.9498767047456091
The accuracy of training set by performing adaboost is 0.949
#validation score
test_ada = accuracy_score(y_test, pred_test_ada)
print(f"The validation accuracy of AdaBoost is: {test_ada}")
The validation accuracy of AdaBoost is: 0.9673167057248075
The accuracy of training set by performing adaboost is 0.967
conf_train_ada = confusion_matrix(y_train, pred_train_ada)
print(f"The confusion matrix of AdaBoost is:\n {conf_train_ada}")
The confusion matrix of AdaBoost is: [[31801 435] [ 1557 5949]]
conf_test_ada = confusion_matrix(y_test, pred_test_ada)
print(f"The confusion matrix of validation of AdaBoost is:\n {conf_test_ada}")
The confusion matrix of validation of AdaBoost is: [[21463 288] [ 493 1652]]
#confusion matrix
True_neg_ada, False_pos_ada, False_neg_ada, True_pos_ada = confusion_matrix(y_test, pred_test_ada).ravel()
#Performance metrics for validation set
#Calculating Recall, Specificity,
#Precision, Balanced accuracy,
#and F1 Score
# Recall
r = recall_score(y_test, pred_test_ada, pos_label=0)
print("The Recall of AdaBoost is: ", r)
s = True_neg_ada / (True_neg_ada + False_pos_ada)
print("The Specificity of AdaBoost is: ", s)
p = precision_score(y_test, pred_test_ada, pos_label=0)
print("The Precision of AdaBoost is: ", p)
b = balanced_accuracy_score(y_test, pred_test_ada)
print("The Balanced Accuracy Score of AdaBoost is: ", b)
f1 = f1_score(y_test, pred_test_ada, pos_label=0)
print("The F1 Score of AdaBoost is: ", f1)
The Recall of AdaBoost is: 0.9867592294607145 The Specificity of AdaBoost is: 0.9867592294607145 The Precision of AdaBoost is: 0.9775460010930953 The Balanced Accuracy Score of AdaBoost is: 0.8784611998119423 The F1 Score of AdaBoost is: 0.9821310087628985
#Applying hyper-parameter and cross validation
#using min_samples_split and
#using max_depth and
#using n_estimators
%%time
parameters = {'learning_rate':[0.01, 0.1, 1, 10, 100],
'n_estimators':[5, 50, 250, 500]}
ada_tuning = GridSearchCV(AdaBoostClassifier(),
param_grid = parameters,
scoring = 'f1',
cv = 2)
CPU times: user 23 µs, sys: 21 µs, total: 44 µs Wall time: 48.2 µs
%%time
ada_tuning.fit(X_train, y_train)
ada = ada_tuning.best_estimator_.predict(X_test)
CPU times: user 43.4 s, sys: 297 ms, total: 43.7 s Wall time: 44 s
#Performance Metrics
recall_ada_tuning = recall_score(y_test, ada, pos_label=0)
print(f"Recall of AdaBoost is: {recall_ada_tuning}\n")
precision_ada_tuning = precision_score(y_test, ada, pos_label=0)
print(f"Precision of AdaBoost is: {precision_ada_tuning}\n")
balanced_ada_tuning = balanced_accuracy_score(y_test, ada)
print(f"Balanced Accuracy of AdaBoost is: {balanced_ada_tuning}\n")
f1_ada_tuning = f1_score(y_test, ada, pos_label=0)
print(f"F1 score of AdaBoost is: {f1_ada_tuning}")
Recall of AdaBoost is: 0.9867592294607145 Precision of AdaBoost is: 0.9775460010930953 Balanced Accuracy of AdaBoost is: 0.8784611998119423 F1 score of AdaBoost is: 0.9821310087628985
from sklearn.ensemble import GradientBoostingClassifier
%%time
grad_boost = GradientBoostingClassifier()
#fitting the model
grad_boost.fit(X_train, y_train)
CPU times: user 1.94 s, sys: 11.4 ms, total: 1.95 s Wall time: 1.96 s
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_iter_no_change=None, presort='deprecated',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0,
warm_start=False)
%%time
pred_train_grad = grad_boost.predict(X_train)
pred_test_grad = grad_boost.predict(X_test)
CPU times: user 58.7 ms, sys: 6.42 ms, total: 65.1 ms Wall time: 72.7 ms
#Accuracy of train
train_acc_grad = accuracy_score(y_train, pred_train_grad)
print(f"The training accuracy of Gradient Boosting Machine is: {train_acc_grad}")
The training accuracy of Gradient Boosting Machine is: 0.9675657994061698
The accuracy of training set by performing gradient descent is 0.967
#validation score
test_grad = accuracy_score(y_test, pred_test_grad)
print(f"The validation accuracy of Gradient Boosting Machine is: {test_grad}")
The validation accuracy of Gradient Boosting Machine is: 0.978657515902243
The accuracy of testing set by performing gradient descent is 0.978
conf_train_grad = confusion_matrix(y_train, pred_train_grad)
print(f"The confusion matrix of Gradient Boosting Machine is:\n {conf_train_grad}")
The confusion matrix of Gradient Boosting Machine is: [[32045 191] [ 1098 6408]]
conf_test_grad = confusion_matrix(y_test, pred_test_grad)
print(f"The confusion matrix of validation of Gradient Boosting Machine is:\n {conf_test_grad}")
The confusion matrix of validation of Gradient Boosting Machine is: [[21610 141] [ 369 1776]]
#confusion matrix
True_neg_grad, False_pos_grad, False_neg_grad, True_pos_grad = confusion_matrix(y_test, pred_test_grad).ravel()
#Performance metrics for validation set
#Calculating Recall, Specificity,
#Precision, Balanced accuracy,
#and F1 Score
# Recall
r = recall_score(y_test, pred_test_grad, pos_label=0)
print("The Recall of Gradient Boosting Machine is: ", r)
s = True_neg_grad / (True_neg_grad + False_pos_grad)
print("The Specificity of Gradient Boosting Machine is: ", s)
p = precision_score(y_test, pred_test_grad, pos_label=0)
print("The Precision of Gradient Boosting Machine is: ", p)
b = balanced_accuracy_score(y_test, pred_test_grad)
print("The Balanced Accuracy Score of Gradient Boosting Machine is: ", b)
f1 = f1_score(y_test, pred_test_grad, pos_label=0)
print("The F1 Score of Gradient Boosting Machine is: ", f1)
The Recall of Gradient Boosting Machine is: 0.9935175394234748 The Specificity of Gradient Boosting Machine is: 0.9935175394234748 The Precision of Gradient Boosting Machine is: 0.983211247099504 The Balanced Accuracy Score of Gradient Boosting Machine is: 0.9107447836977514 The F1 Score of Gradient Boosting Machine is: 0.9883375257260462
#Applying hyper-parameter and cross validation
#using min_samples_split and
#using max_depth and
#using n_estimators
%%time
parameters = {'learning_rate':[0.01, 0.1, 1, 10, 100],
'n_estimators':[5, 50, 250, 500]}
grad_tuning = GridSearchCV(GradientBoostingClassifier(),
param_grid = parameters,
scoring = 'f1',
cv = 2)
CPU times: user 27 µs, sys: 27 µs, total: 54 µs Wall time: 57 µs
%%time
grad_tuning.fit(X_train, y_train)
grad = grad_tuning.best_estimator_.predict(X_test)
CPU times: user 1min 32s, sys: 182 ms, total: 1min 32s Wall time: 1min 32s
#Performance Metrics
recall_grad_tuning = recall_score(y_test, grad, pos_label=0)
print(f"Recall of Gradient Boosting Machine is: {recall_ada_tuning}\n")
precision_grad_tuning = precision_score(y_test, grad, pos_label=0)
print(f"Precision of Gradient Boosting Machine is: {precision_ada_tuning}\n")
balanced_grad_tuning = balanced_accuracy_score(y_test, grad)
print(f"Balanced Accuracy of Gradient Boosting Machine is: {balanced_ada_tuning}\n")
f1_grad_tuning = f1_score(y_test, grad, pos_label=0)
print(f"F1 score of Gradient Boosting Machine is: {f1_ada_tuning}")
Recall of Gradient Boosting Machine is: 0.9867592294607145 Precision of Gradient Boosting Machine is: 0.9775460010930953 Balanced Accuracy of Gradient Boosting Machine is: 0.8784611998119423 F1 score of Gradient Boosting Machine is: 0.9821310087628985
from sklearn.linear_model import LinearRegression
linear = LinearRegression()
linear.fit(X_train, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
linear_train = linear.predict(X_train)
linear_test = linear.predict(X_test)
r2_linear_train = metrics.r2_score(y_train, linear_train)
print("The R2 value of linear regression for train set is: ", r2_linear_train)
The R2 value of linear regression for train set is: 0.4566998525123023
r2_linear_test = metrics.r2_score(y_test, linear_test)
print("The R2 value of linear regression for test set is: ", r2_linear_test)
The R2 value of linear regression for test set is: 0.321399849781256
RMSE_train = sqrt(metrics.mean_squared_error(y_train, linear_train))
print("The RMSE value of linear regression for train set is: ", RMSE_train)
The RMSE value of linear regression for train set is: 0.28849948752600724
RMSE_test = sqrt(metrics.mean_squared_error(y_test, linear_test))
print("The RMSE value of linear regression for test set is: ", RMSE_test)
The RMSE value of linear regression for test set is: 0.23546969896661887
By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.
from sklearn.ensemble import RandomForestRegressor
RANDOM_STATE = 0
estimator = RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE)
estimator.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=0, verbose=0, warm_start=False)
rf_train = estimator.predict(X_train)
rf_test = estimator.predict(X_test)
r2_rf_train = metrics.r2_score(y_train, rf_train)
print("The R2 value of random forest regression for train set is: ", r2_rf_train)
The R2 value of random forest regression for train set is: 0.9654276594245501
r2_rf_test = metrics.r2_score(y_test, rf_test)
print("The R2 value of random forest regression for test set is: ", r2_rf_test)
The R2 value of random forest regression for test set is: 0.7255034623493798
RMSE_train_rf = sqrt(metrics.mean_squared_error(y_train, rf_train))
print("The RMSE value of random forest regression for train set is: ", RMSE_train_rf)
The RMSE value of random forest regression for train set is: 0.07277622524881817
RMSE_test_rf = sqrt(metrics.mean_squared_error(y_test, rf_test))
print("The RMSE value of random forest regression for test set is: ", RMSE_test_rf)
The RMSE value of random forest regression for test set is: 0.14976022652269988
By comparing the RMSE of training set and testing set, as the RMSE of train set is lower than the test set, we can say that there is not overfitting of the model.
Changing some hyperparameters of the Random Forest model.
Adding max_depth parameter
estimator_new = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=RANDOM_STATE)
estimator_new.fit(X_train, y_train)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=5, max_features='auto', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=0, verbose=0, warm_start=False)
rf_train_max = estimator_new.predict(X_train)
rf_test_max = estimator_new.predict(X_test)
r2_rf_train_max = metrics.r2_score(y_train, rf_train_max)
print("The R2 value of random forest regression for train set is: ", r2_rf_train_max)
The R2 value of random forest regression for train set is: 0.8019695201088521
r2_rf_test_max = metrics.r2_score(y_test, rf_test_max)
print("The R2 value of random forest regression for test set is: ", r2_rf_test_max)
The R2 value of random forest regression for test set is: 0.7293726308891015
RMSE_train_rf_max = sqrt(metrics.mean_squared_error(y_train, rf_train_max))
print("The RMSE value of random forest regression for train set is: ", RMSE_train_rf_max)
The RMSE value of random forest regression for train set is: 0.1741771397572827
RMSE_test_rf_max = sqrt(metrics.mean_squared_error(y_test, rf_test_max))
print("The RMSE value of random forest regression for test set is: ", RMSE_test_rf_max)
The RMSE value of random forest regression for test set is: 0.14870100737382277
By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.
from sklearn.svm import SVR
%%time
estimator = SVR(kernel='linear')
estimator.fit(X_train, y_train)
CPU times: user 2h 31s, sys: 27.9 s, total: 2h 59s Wall time: 2h 2min 11s
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
kernel='linear', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
%%time
SVM_linear_train = estimator.predict(X_train)
SVM_linear_test = estimator.predict(X_test)
CPU times: user 20.2 s, sys: 76.2 ms, total: 20.3 s Wall time: 20.4 s
r2_SVM_linear_train = metrics.r2_score(y_train, SVM_linear_train)
print("The R2 value of SVR regression for train set is: ", r2_SVM_linear_train)
The R2 value of SVR regression for train set is: 0.36829308373027503
r2_SVM_linear_test = metrics.r2_score(y_test, SVM_linear_test)
print("The R2 value of SVR regression for test set is: ", r2_SVM_linear_test)
The R2 value of SVR regression for test set is: 0.17347766769229633
RMSE_train_SVM_linear = sqrt(metrics.mean_squared_error(y_train, SVM_linear_train))
print("The RMSE value of SVR regression for train set is: ", RMSE_train_SVM_linear)
The RMSE value of SVR regression for train set is: 0.3110877791259909
RMSE_test_SVM_linear = sqrt(metrics.mean_squared_error(y_test, SVM_linear_test))
print("The RMSE value of SVR regression for test set is: ", RMSE_test_SVM_linear)
The RMSE value of SVR regression for test set is: 0.25986952291915366
By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.
%%time
from sklearn.svm import SVR
estimator = SVR(kernel='poly')
estimator.fit(X_train, y_train)
CPU times: user 13min 9s, sys: 2.73 s, total: 13min 12s Wall time: 13min 16s
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
%%time
SVM_poly_train = estimator.predict(X_train)
SVM_poly_test = estimator.predict(X_test)
CPU times: user 7.61 s, sys: 37.5 ms, total: 7.65 s Wall time: 7.71 s
r2_SVM_poly_train = metrics.r2_score(y_train, SVM_poly_train)
print("The R2 value of Poly SVR regression for train set is: ", r2_SVM_poly_train)
The R2 value of Poly SVR regression for train set is: -0.47731489592279663
r2_SVM_poly_test = metrics.r2_score(y_test, SVM_poly_test)
print("The R2 value of Poly SVR regression for test set is: ", r2_SVM_poly_test)
The R2 value of Poly SVR regression for test set is: -0.4518084111005336
RMSE_train_SVM_poly = sqrt(metrics.mean_squared_error(y_train, SVM_poly_train))
print("The RMSE value of Poly SVR regression for train set is: ", RMSE_train_SVM_poly)
The RMSE value of Poly SVR regression for train set is: 0.47573124186176774
RMSE_test_SVM_poly = sqrt(metrics.mean_squared_error(y_test, SVM_poly_test))
print("The RMSE value of Poly SVR regression for test set is: ", RMSE_test_SVM_poly)
The RMSE value of Poly SVR regression for test set is: 0.34441551493534034
By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.
Also, by seeing the R2 values being negative, I think that the model did not work properly on the dataset. As Polynomial SVM can be applied for regression as well as classification tasks, it should have worked on the dataset, but it clearly did not work properly.
%%time
from sklearn.svm import SVR
estimator = SVR(kernel='rbf')
estimator.fit(X_train, y_train)
CPU times: user 13.4 s, sys: 196 ms, total: 13.6 s Wall time: 13.8 s
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
%%time
SVM_rbf_train = estimator.predict(X_train)
SVM_rbf_test = estimator.predict(X_test)
CPU times: user 7.2 s, sys: 28.5 ms, total: 7.23 s Wall time: 7.23 s
r2_SVM_rbf_train = metrics.r2_score(y_train, SVM_rbf_train)
print("The R2 value of RBF SVR regression for train set is: ", r2_SVM_rbf_train)
The R2 value of RBF SVR regression for train set is: 0.6046403742940157
r2_SVM_rbf_test = metrics.r2_score(y_test, SVM_rbf_test)
print("The R2 value of RBF SVR regression for train set is: ", r2_SVM_rbf_test)
The R2 value of RBF SVR regression for train set is: 0.4952244323911741
RMSE_train_SVM_rbf = sqrt(metrics.mean_squared_error(y_train, SVM_rbf_train))
print("The RMSE value of RBF SVR regression for train set is: ", RMSE_train_SVM_rbf)
The RMSE value of RBF SVR regression for train set is: 0.24610548500488857
RMSE_test_SVM_rbf = sqrt(metrics.mean_squared_error(y_test, SVM_rbf_test))
print("The RMSE value of RBF SVR regression for train set is: ", RMSE_test_SVM_rbf)
The RMSE value of RBF SVR regression for train set is: 0.20308470468498374
By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.
from sklearn.decomposition import PCA
%%time
pca = PCA(n_components=None)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
CPU times: user 148 ms, sys: 136 ms, total: 283 ms Wall time: 129 ms
print(pca.explained_variance_ratio_.cumsum())
plt.plot(pca.explained_variance_ratio_.cumsum(), '-o');
plt.xticks(ticks= range(X_train_pca.shape[1]), labels=[i+1 for i in range(X_train_pca.shape[1])])
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.show()
[0.9999818 0.99998828 0.9999898 0.99999106 0.99999229 0.99999337 0.99999438 0.9999953 0.99999614 0.99999695 0.99999773 0.99999846 0.99999919 0.99999963 1. 1. 1. ]
From the above graph, we can see that 5 PC's explain 90% of the variance in the dataset, so we will be taking 5 PC's for our evaluation.
X_train_pca2 = X_train_pca[:, 0:5]
X_test_pca2 = X_test_pca[:, 0:5]
Linear Regression with PCA transformed data
pca_linear = LinearRegression()
pca_linear.fit(X_train_pca2, y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
pca_linear_train = pca_linear.predict(X_train_pca2)
pca_linear_test = pca_linear.predict(X_test_pca2)
r2_pca_linear_train = metrics.r2_score(y_train, pca_linear_train)
print("The R2 value of Linear Regression with PCA for train set is: ", r2_pca_linear_train)
The R2 value of Linear Regression with PCA for train set is: 0.45535214265750357
r2_pca_linear_test = metrics.r2_score(y_test, pca_linear_test)
print("The R2 value of Linear Regression with PCA for test set is: ", r2_pca_linear_test)
The R2 value of Linear Regression with PCA for test set is: 0.3221005326816868
RMSE_pca_linear_train = sqrt(metrics.mean_squared_error(y_train, pca_linear_train))
print("The RMSE value of Linear Regression with PCA for train set is: ", RMSE_pca_linear_train)
The RMSE value of Linear Regression with PCA for train set is: 0.2888570916792306
RMSE_pca_linear_test = sqrt(metrics.mean_squared_error(y_test, pca_linear_test))
print("The RMSE value of Linear Regression with PCA for test set is: ", RMSE_pca_linear_test)
The RMSE value of Linear Regression with PCA for test set is: 0.23534810143732313
By comparing the RMSE of training set and testing set, as there is not much difference between the two RMSE's we can say that the model is working properly.
RF Regression with PCA transformed data
pca_randomFor = RandomForestRegressor()
pca_randomFor.fit(X_train_pca2, y_train)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
pca_rf_train = pca_randomFor.predict(X_train_pca2)
pca_rf_test = pca_randomFor.predict(X_test_pca2)
r2_pca_rf_train = metrics.r2_score(y_train, pca_rf_train)
print("The R2 value of RF with PCA for train set is: ", r2_pca_rf_train)
The R2 value of RF with PCA for train set is: 0.9647350869240827
r2_pca_rf_test = metrics.r2_score(y_test, pca_rf_test)
print("The R2 value of RF with PCA for test set is: ", r2_pca_rf_test)
The R2 value of RF with PCA for test set is: 0.7151117293916662
RMSE_pca_rf_train = sqrt(metrics.mean_squared_error(y_train, pca_rf_train))
print("The RMSE value of RF with PCA for train set is: ", RMSE_pca_rf_train)
The RMSE value of RF with PCA for train set is: 0.07350155775812328
RMSE_pca_rf_test = sqrt(metrics.mean_squared_error(y_test, pca_rf_test))
print("The RMSE value of RF with PCA for test set is: ", RMSE_pca_rf_test)
The RMSE value of RF with PCA for test set is: 0.15256866190776933
By comparing the RMSE of training set and testing set, as the RMSE of test set is higher than that of train set, we can say that the model is overfitting(But theres not much of a higher difference between the RMSE's of train and test set).
SVM Regression with PCA transformed data
pca_SVM_reg = SVR()
pca_SVM_reg.fit(X_train_pca2, y_train)
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='scale',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
pca_SVM_reg_train = pca_SVM_reg.predict(X_train_pca2)
pca_SVM_reg_test = pca_SVM_reg.predict(X_test_pca2)
r2_pca_SVM_reg_train = metrics.r2_score(y_train, pca_SVM_reg_train)
print("The R2 value of SVM with PCA for train set is: ", r2_pca_SVM_reg_train)
The R2 value of SVM with PCA for train set is: 0.6097142687392625
r2_pca_SVM_reg_test = metrics.r2_score(y_test, pca_SVM_reg_test)
print("The R2 value of SVM with PCA for test set is: ", r2_pca_SVM_reg_test)
The R2 value of SVM with PCA for test set is: 0.510606397927333
RMSE_pca_SVM_reg_train = sqrt(metrics.mean_squared_error(y_train, pca_SVM_reg_train))
print("The RMSE value of SVM with PCA for train set is: ", RMSE_pca_SVM_reg_train)
The RMSE value of SVM with PCA for train set is: 0.24452117357632838
RMSE_pca_SVM_reg_test = sqrt(metrics.mean_squared_error(y_test, pca_SVM_reg_test))
print("The RMSE value of SVM with PCA for test set is: ", RMSE_pca_SVM_reg_test)
The RMSE value of SVM with PCA for test set is: 0.19996647759454383
By comparing the RMSE of training set and testing set, as the RMSE of train set is somewhat higher than the RMSE of test set, we can say that the model is underfitting.
classf_models = [
{
'classification_label': 'Logistic Regression',
'classification_model': LogisticRegression(max_iter = 10000, C=0.1),
},
{
'classification_label': 'KNN',
'classification_model': KNeighborsClassifier(n_neighbors=1),
},
{
'classification_label': 'Decision Tree',
'classification_model':GridSearchCV(DecisionTreeClassifier(),
param_grid = {'min_samples_split':[5, 10, 20, 30, 40, 50]},
scoring = 'f1',
cv = 2),
},
{
'classification_label': 'Random Forest',
'classification_model': GridSearchCV(RandomForestClassifier(),
param_grid = {'min_samples_split':[1,2,3],
'max_depth':[1,2,3],
'n_estimators':[200,400]},
scoring = 'f1',
cv = 2),
},
{
'classification_label': 'AdaBoost Model',
'classification_model': GridSearchCV(AdaBoostClassifier(),
param_grid = {'learning_rate':[0.01, 0.1, 1, 10, 100],
'n_estimators':[5, 50, 250, 500]},
scoring = 'f1',
cv = 2),
},
{
'classification_label': 'Gradient Descent',
'classification_model': GridSearchCV(GradientBoostingClassifier(),
param_grid = {'learning_rate':[0.01, 0.1, 1, 10, 100],
'n_estimators':[5, 50, 250, 500]},
scoring = 'f1',
cv = 2),
}
]
for clf_md in classf_models:
model = clf_md['classification_model']
print(model)
model.fit(X_train, y_train)
y_pred=model.predict(X_test)
false_rate, true_rate, thresholds = metrics.roc_curve(y_test, model.predict_proba(X_test)[:,1])
auc_score = metrics.roc_auc_score(y_test, model.predict(X_test))
plt.plot(false_rate, true_rate, label='%s The ROC (area = %0.2f)' % (clf_md['classification_label'], auc_score))
plt.plot([0, 1], [0, 1],'k--',label='AOC')
plt.xlim([0.0, 1.1])
plt.ylim([0.0, 1.1])
plt.xlabel('1-Specificity(FPR)')
plt.ylabel('Sensitivity(TPR)')
plt.title('ROC Curve to determine the best model')
plt.legend(loc="lower right")
plt.show()
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=10000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=1, p=2,
weights='uniform')
GridSearchCV(cv=2, error_score=nan,
estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort='deprecated',
random_state=None,
splitter='best'),
iid='deprecated', n_jobs=None,
param_grid={'min_samples_split': [5, 10, 20, 30, 40, 50]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring='f1', verbose=0)
GridSearchCV(cv=2, error_score=nan,
estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
class_weight=None,
criterion='gini', max_depth=None,
max_features='auto',
max_leaf_nodes=None,
max_samples=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None,
oob_score=False,
random_state=None, verbose=0,
warm_start=False),
iid='deprecated', n_jobs=None,
param_grid={'max_depth': [1, 2, 3], 'min_samples_split': [1, 2, 3],
'n_estimators': [200, 400]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring='f1', verbose=0)
GridSearchCV(cv=2, error_score=nan,
estimator=AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=None,
learning_rate=1.0, n_estimators=50,
random_state=None),
iid='deprecated', n_jobs=None,
param_grid={'learning_rate': [0.01, 0.1, 1, 10, 100],
'n_estimators': [5, 50, 250, 500]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring='f1', verbose=0)
GridSearchCV(cv=2, error_score=nan,
estimator=GradientBoostingClassifier(ccp_alpha=0.0,
criterion='friedman_mse',
init=None, learning_rate=0.1,
loss='deviance', max_depth=3,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
n_estimators=100,
n_iter_no_change=None,
presort='deprecated',
random_state=None,
subsample=1.0, tol=0.0001,
validation_fraction=0.1,
verbose=0, warm_start=False),
iid='deprecated', n_jobs=None,
param_grid={'learning_rate': [0.01, 0.1, 1, 10, 100],
'n_estimators': [5, 50, 250, 500]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring='f1', verbose=0)
From the above ROC curve created we can see the performance for different models. The models that performed good are Decision Tree(0.94), Gradient Descent(0.94), KNN(0.92), AdaBoost Model(0.88), Random Forest Classifier(0.85) and Logistic Regression(0.85).
Out of all the models, we can choose Decision Tree(0.94), Gradient Descent(0.94), KNN(0.92) as they all have higher/similar accuracy.
If we had extra time, we would have perfromed our models on the whole dataset (1.5 million rows), but as mentioned at the start, with the whole dataset to implement a single algorithm(Random Forest) it took more than 15 hours and still didn't run.
Also, if given extra time, there are many similar datasets related to the fraud detection, we could have tried on different test sets, and check if the model still performs the way it does now.